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Abstract 

We have carefully instrumented a large portion of the 
population living in a university graduate dormitory by 
giving participants Android smart phones running our 
sensing software. In this paper, we propose the novel 
problem of predicting mobile application (known as 
"apps") installation using social networks and explain 
its challenge. Modem smart phones, like the ones used 
in our study, are able to collect different social net- 
works using built-in sensors, (e.g. Bluetooth proximity 
network, call log network, etc) While this information 
is accessible to app market makers such as the iPhone 
AppStore, it has not yet been studied how app mar- 
ket makers can use these information for marketing re- 
search and strategy development. We develop a simple 
computational model to better predict app installation 
by using a composite network computed from the dif- 
ferent networks sensed by phones. Our model also cap- 
tures individual variance and exogenous factors in app 
adoption. We show the importance of considering all 
these factors in predicting app installations, and we ob- 
serve the surprising result that app installation is indeed 
predictable. We also show that our model achieves the 
best results compared with generic approaches: our pre- 
diction results are four times better than random, and 
reach almost 45% prediction precision with 45% recall. 
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iors, such as 
and diseases 



Introduction 

projects have demonstrated that 
correlate with individual behav- 
obesity dChristakis and Fowler 2007t 
dCoIizza et al. 2007t . to name 



two. Many large-scale networks are analyzed, 
and this field is becoming increasing popu- 
lar ( [Eagle, Macy, and Claxton 20T0 1 ( [Leskovec, Adamic, and 

We are interested in studying the network-based pre- 
diction for mobile applications (referred as "apps") in- 
stallation, as the mobile application business is growing 
rapidly tElIison 2010J . The app market makers, such as 
iPhone AppStore and Android Market, run on almost all 
modern smart phones, and they have access to phone data 
and sensor data. As a result, app market makers can infer 
different types of networks, such as the call log network and 



the bluetooth proximity network, from phone data. However, 
it remains an unknown yet important question whether these 
data can be used for app marketing. Therefore, in this paper 
we address the challenge of utilizing all different network 
data obtained from smart phones for app installation predic- 
tion. 

It is natural to speculate that there are network effects in 
users' app installation, but we eventually realize that it was 
very difficult to adopt existing tools from large-scale social 
network research to model and predict the installation of cer- 
tain mobile apps for each user due to the following facts: 



1. 



The underlying network is not observable. While 
many projects assume phone call logs are tiTie so- 
cial/friendship networks ( [Zhang and Dantu 2010 ), others 
may use whatever network that is available as the un- 
derlying social network. Researchers have discovered 
that call network may not be a good approximation 
( Eagle and Pentland 2006[ l. On the other hand, smart 
phones can easily sense multiple networks using built- 
in sensors and software: a) The call logs can be used to 
form phone call networks; b) Bluetooth radio can be used 



to infer proximity networks ( Eagle and Pentland 2006 
c) GPS 
terns 



data can be used to infer user moving pat- 
and furthermore their working places and affilia- 
tions (IFarrahi and Gatica-Perez 2010t : d) Social network 
tools (such as the Facebook app and the Twitter app) can 
observe users' online friendship network. In this work, 
our key idea is to infer an optimal composite network, the 
network that best desciibes app installation, from multi- 
ple layers of different networks easily observed by mod- 
ern smart phones, rather than assuming a certain network 
as the real social network explaining app installation. 

Hubgrnj^n^^^^7[ ||;^^ epidemics ( Ganesh, Massoulie, and Towsley 20()5) 
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and Twitter networks ( Yang and Leskovec 2010| l is based 
on the fact that network is the only mechanism for 
adoption. The only way to get the flu is to catch the flu 
from someone else, and the only way to retweet is to see 
the tweet message from someone else. For mobile app, 
this is, however, not true at all. Any user can simply open 
the AppStore (on iPhones) or the Android Market (on 
Android phones), browse over different lists of apps, and 
pick the one that appears most interesting to the user to 
install without peer influence. One big chaUenge, which 



makes modeling the spreading of apps difficult, is that 
one can install an app without any external influence 
and information. One major contribution of this paper is 
that we demonstrate it is still possible to build a tool to 
observe network effects with such randomness. 

3. The individual behavioral variance in app installation is 
so significant that any network effect might possibly be 
rendered unobservable from the data. For instance, some 
geek users may try and install all hot apps on the market, 
while many inexperienced users find it troublesome even 
to go through the process of installing an app, and as a 
result they only install very few apps. 

4. There are exogenous factors in the app installation behav- 
iors. One particular factor is the popularity of apps. For in- 
stance, the Pandora Radio app is vastly popular and highly 
ranked in the app store, while most other apps are not. Our 
model takes this issue into account too, and we show that 
exogenous factors are important in increasing prediction 
precision. 

Classic diffusion models such as Granovetter's 
work ( Granovetter and Soong 1983 ' are applicable to 



simulation, but lack data fitting and prediction powers 
Statistical analysis used by social scientists such as matched 
sample estimation (Aral, Muchnik, and Sundararajan 2009 



are only for identifying network effects and mechanism. 
Recently works in computer science for inferring network 
structure assume simple diffusion mechanism, and are 
only applicable to artificial simulation data on real net- 



works ( Gomez Rodriguez, Leskovec, and Krause 2010) l ( Myers 
On the other hand, our work addresses the above is- 
sues in practical app marketing prediction. On the 
mobile-based behavioral prediction side. The closest 
research is the churn prediction problem in mobile net- 
works ( |Richter, Yom-Tov, and Slonim 2010) , which uses 
call logs to predict users' future decisions of switching 
mobile providers. To our knowledge, we don't see other 
related works for similar problems. 

Data 

We collected our data from March to July 2010 with 55 par- 
ticipants, who are residents living in a married graduate stu- 
dent residency of a major US university. Each participant 
is given an Android-based cell phone with a built-in sens- 
ing software developed by us. The software runs in a pas- 
sive manner, and it didn't interfere the normal usage of the 
phone. 

Our software is able to capture all call logs in the ex- 
periment period. We therefore obtained a call log network 
between all participants by treating participants as nodes 
and the number of calls between two nodes as weights 
for the edge in-between. The software also scans near- 
by phones and other Bluetooth devices every five min- 
utes to capture the proximity network between individu- 
als. The counts on the number of Bluetooth hits are used 
as edge weights similar to the call log network as done in 



Eagle et al (Eagle and Pentland 2006). We have also col- 



lected the affiliation network and the friendship network 
by deploying a survey, which lists all the participants and 



ask each one to list their affiliations (i.e. the academic de- 
partment), and rate their relationships with everyone else 
in the study. We believe for app market makers the af- 
filiation network can also be inferred simply by using 
phone GPS/cell tower information as shown by Farrahi et 
al jFarrahi and Gatica- Perez 2010t . However, this is not the 
focus of this work, and here we simply use survey data in- 
stead. Though the friendship network is also collected using 
surveys, we suggest that the app market makers can obtain 
the friendship network from phones by collecting data from 
social networking apps such as the Facebook and Twitter 
apps. We summarize all the networks obtained from both 
phones and surveys in Table [T] We refer to all networks in 
Table [1] as candidate networks, and all candidate networks 
will be used to compute the optimal composite network. It 
should be noted that all networks are reciprocal in this work. 

We want to emphasize the fact that the network data we 
used in Table[T]are obtainable for app market makers such as 
Apple iTunes Store, as they have access to phone sensors as 
well as user accounts. Therefore, our approach in this paper 
can be beneficial to them for marketing research, customized 
app recommendation and marketing strategy making. 

Our built-in sensing platform is constantly monitoring the 
installation of mobile apps. Every time a new app is in- 
stalled, this information will be collected and sent back to 
our server within a day. Overall, we receive a total of 821 
apps installed by all 55 users. Among them, 173 apps have 
at least two users. For this analysis, we only look at app in- 
stallations and ig nore un-installations. We first demonstrate 
andfeaslaasfe£ffll@ lf the apps in the study: In Fig. |l(a)| we plot 
the distribution of number of users installing each app. We 
discover that our data correspond very well with a power- 
law distribution with exponential cut. In Fig. |l(b)| we plot 
the distribution of number of apps installed per user, which 
fits well with an exponential distribution. 

Fig. |l(a)| and |l(b)| illustrate detailed insight into our 
dataset. Even with a small portion of participants, the distri- 
bution characteristic is clearly observable. We find that apps 
have a power-law distribution of users, which suggests that 
most apps in our study community have a very small user 
pool, and very few apps have spread broadly. The exponen- 
tial decay in Fig. |l(b)| suggests that the variance of individ- 
ual user is significant; There are users having more than 100 
apps installed, and there are users having only a couple of 
apps. 

Model 

In this section, we describe our novel model for capturing 
the app installation behaviors in networks. In the following 
content, G denotes the adjacency matrix for graph G. Each 
user is denoted by u g {1, U}. Each app is denoted by 
a e {1, A}. We define the binary random variable to 
represent the status of adoption (i.e. app installation): = 
1 if a is adopted by user n, if not. 

As introduced in the previous section, the different social 
relationship networks that can be inferred by phones are de- 
noted by G^, G*^. Our model aims at inferring an opti- 
mal composite network G°p' with the most predictive power 
from all the candidate social networks. The weight of edge 



Network 


Type 


Source 


Notation 


Call Log Network 


Undirected,Weighted 


# of Calls 




Bluetooth Proximity Network 


Undirected,Weighted 


# of Bluetooth Scan Hits 


Qb 


Friendship Network 


Undirected,Binary 


Survey Results (1: friend; 0: not friend) 




Affiliation Network 


Undirected,Binary 


Survey Results (1: same; 0: different) 


G" 



Table 1 : Network data used in this study. 




Figure 1 : Circles are real data, and lines are fitting curves. Left: Distribution of number of users for each app. Right: Distribution 
of number of apps each user installed. 



Ci J in graph G™ is denoted by it;™ . The weight of an edge 
in G°P' is simply denoted by w^j . 

Adoption Mechanism 

One base idea of our model is the non-negative accumulative 
assumption, which distinguishes our model from other linear 
mixture models. We define G°p' to be: 



G°P* = a,„G", where Vjti, a,, 



> 0. 



(1) 



The intuition behind this non-negative accumulative as- 
sumption is as follows: if two nodes are connected by a cer- 
tain type of network, their app installation behaviors may 
or may not correlate with each other; On the other hand, if 
two nodes are not connected by a certain type of network, 
the absence of the link between them should lead to nei- 
ther positive or negative effect on the correlation between 
their app installations. As shown in Table |2] in the exper- 
iment session, our non-negative assumption brings signifi- 
cant performance increase in prediction. Non-negative as- 
sumption also makes the model stochastic and theoretically 
sound. We treat binary graphs as weighted graphs as well. 
Since ai, aM is the non-negative weights for each can- 
didate network in describing the optimal composite network. 
We later refer to the vector (ai , . . . , cim ) as the optimal com- 
posite vector Our non-negative accumulative formulation is 
also similar to mixture matrix models in machine learning 
literature ( El-Yaniv, Pechyony, and Yom-Tov 2008| l. 

We continue to define the network potential Pa{i)' 



Pa(i) 



E 



(2) 



where the neighbor of node i is defined by: 
A/'(i) = {j|3ms.t. > 0}. 



(3) 



The potential Pa{i) can also be decomposed into poten- 
tials from different networks: 



(4) 



where p"''{i) is the potential computed from one single can- 
didate network. We can think of Paii) as the potential of 
i installing app a based on the observations of its neighbors 
on the composite network. The definition here is also similar 
to incoming influence from adopted peers for many cascade 
models (Kempe, Kleinberg, and Tardos 2003 1. 
Finally our conditional probability is defined as: 

Prob(x;', = IK, : u' e JV{u)) = 1 - Gxp(-s„ ~Pa{u)), 

(5) 

where Vu, s„ > 0. captures the individual susceptibility 
of apps, regardless of which app. We use the exponential 
function for two reasons: 

1. The monotonic and concave properties of f{x) = 1 — 
exp(— a;) matches with recent research (ICentola 2010t . 
which suggests that the probability of adoption increases 
at a decreasing rate with increasing external network sig- 
nals. 

2. It forms a concave optimization problem during maxi- 
mum Ukelihood estimation in model training. 

As shown in the experiment section and based on our expe- 
riences, this exponential model yields the best performance. 



Model Training 

We move on to discuss model training. During the train- 
ing phase, we want to estimate the optimal values for the 
ai, aM and si, sjj- We formalize it as an optimiza- 
tion problem by maximizing the sum of all conditional like- 
lihood. 

Given all candidate networks, a training set composed of 
a subset of apps TRAIN C {!,..., A}, and {x^ : Va G 
TRAIN, u e {1, ...,[/}}, we compute: 

arg max /(si, s^/, ai, ajv/), 

si,...,su,0!i,...,aM 

Subject to: Vm, s„ > 0, Vm, a„i > (6) 

where: 

/(si, su,ai, um) 



= log 



n n Prob«-lK, :u'eAA(w)) 



aeTRAINu:2:"=l 



U (l-Prob(xJl = l|<, lu'eAAH)) 



E 

a G TRAIN 



log(l - exp(- 



X] {Su+Pa{u)) 



u-.xf.—O 



(7) 
(8) 



GP, which can be easily plugged into our composite network 
framework. is constructed by adding a virtual node U+1 
and one edge e^j+i „ for each actual user u. The correspond- 
ing weight of each edge wu+i.u for computing (m) is C"^, 
where C" is a positive number describing the popularity of 
an app. In our experiment, we use the number of installations 
of the app in this experimental community as C". We have 
been looking at other sources to obtain reliable estimates for 
C", but we found that the granularity from public sources to 
be unsatisfying. In practice for app market makers, we argue 
that C can be easily obtained accurately by counting app 
downloads and app ranks. 

The exogenous factors also increase accuracy in measur- 
ing network effects for a non-trivial reason: Considering 
a network of two nodes connected by one edge, and both 
nodes installed an app. If this app is very popular, then the 
fact that both nodes have this app may not imply a strong 
network effect. On the contrary, if this app is very uncom- 
mon, the fact that both nodes have this app implies a strong 
network effect. Therefore, introducing exogenous factors 
does help our algorithm better calibrate network weights. 

Experiments 

Our algorithm predicts the probability of adoption (i.e. 
installing an app) given its neighbor's adoption status. 
Pi e [0,1] denotes the predicted probability of instal- 
lation, while Xi e {0, 1} denotes the actual outcome. 
The most common prediction measure is the Root Mean 

Square Error (RMSE = 



This is a concave optimization problem. Therefore, global 
optimal is guaranteed, and there exist efficient algorithms 



scalable to larger datasets (Boyd and Vandenberghe 2004). 



We use a MATLAB built-in implementation here, and it usu- 
ally take a few seconds during optimization in our experi- 
ments. 

Compared with works on inferring net- 



YJLiiPi - XiY)- This mea- 
sure is known to assess badly the prediction method's abil- 
ity jGoel et al. 20 10). Since in our dataset most users have 
installed very few apps, a baseline approach can simply pre- 
dict the same small pi and still achieve very low RMSE. 

For app marketing, the key objective is not to know 
the probability prediction for each app installation, but to 
rank and identify a sub-group of individuals who are more 
likely to appreciate and install certain apps compared with 



works ( Gomez Rodriguez, Leskovec, and Krause 2010[ l (Myers anAiffli^ivaas^lOftierefore, we mainly adopt the approach 



our work is different as we compute G°'" from existing can 
didates networks. In addition, we don't need any additional 
regularization term or tuning parameters in the optimization 
process. 

We emphasize that our algorithm doesn't distinguish the 
causality problem ("Aral, Muchnik, and Sundararajan 2009J 
in network effects: i.e., we don't attempt to understand the 
different reasons why network neighbors have similar app 
installation behaviors. It can either be diffusion (i.e. my 
neighbor tells me), or homophily (i.e. network neighbors 
share same interests and personality). Instead, our focus is 
on prediction of app installation, and we leave the causality 
problem as future work. 

Virtual Network for Exogenous Factors 

Obvious exogenous factors include the popularity and qual- 
ity of an app. The popularity and quality of an app will 
affect the ranking and review of the app in the App- 
Store/AppMarket, and as a result higher/lower likelihood of 
adoption. We can model this by introducing a virtual graph 



in rank-aware measures from information retrieval prac- 
tices (Manning et al. 2008). For each app, we rank the like- 
lihood of adoption computed by prediction algorithms, and 
study the following factors: 

a) Mean Precision at k (MP-fc): We select the top k indi- 
viduals with highest likelihood of adoption as predicted 
adopters from our algorithms, and compute precision at k 

/ # true adopters amonji k predicted adopters \ \i r • • 

( 1 — — ). We average precisions 

at k among all apps in the testing set to get MP-fc. On 
average each app has five users in our dataset. Therefore, 
the default value for k is five in the following text. MP- 
k measures algorithm's performance on predicting most 
likely nodes. 

b) Optimal Fi -score (referred later simply as Fi Score). The 
optimal Fi -score is computed by computing Fi -scores 

^ 2xprecisionxrecall ^f^^. ^^^j^ j^^^ Precision-RcCall 

^ precision+recall ^ ^ 

curve and selecting the largest Fi value. Unlike MP-fc, the 
optimal Fi score is used to measure the overall prediction 
performance of our algorithms. For instance, Fi = 0.5 



suggests the algorithm can reach a 50% precision at 50% 
recall. 

Prediction using Composite Network 

To begin with, we illustrate different design aspects for our 
algorithm. 

To demonstrate the importance of modeling both net- 
works and individual variances in our model, we here 
demonstrate the prediction performance with five configu- 
rations using a 5-fold cross-validation: a) to model both in- 
dividual variance and network effects; b) to model both indi- 
vidual variance and network effects, but exclude the virtual 
network capturing exogenous factors; c) to model with 
only individual variance (by forcing am = in Eq.|6l), d) to 
model with only network effects (by forcing s„ ~ 0,Vm), 
and e) to model with network effects while allowing the 
composite vector to be negative. The results are illustrated 
in Table 12] 

We find the surprising results that app installations are 
highly predictable with individual variance and network in- 
formation as shown in Table |2] In addition. Table |2] clearly 
suggests that all our assumptions for the model are indeed 
correct, and both individual variance and network effects 
play important roles in app installation mechanism, as well 
as the exogenous factors modeled by G^. 

We also notice that while accuracy almost doubles, it is 
often impossible to realize this improvement using RMSE. 
Therefore, we will not RMSE for the rest of the work. 



0.5 




Figure 2: We demonstrate the prediction performances using 
each single network here. For comparison, we also show the 
result of random guess, and the result using our approach, 
which combines all potential evidence. 

We now illustrate the prediction performance when our 
algorithm is only allowed to use one single network. The 
results are shown in Fig. |2] We find that except the affilia- 
tion network, almost all other networks predict well above 
chance level. The call log network seems to achieve the best 
results. We conclude that while network effects are strong in 
app installations, a well-crafted model such as our approach 
can vastly increase the performance by computing the com- 
posite network and counting other factors in. 

Prediction Performance 

We now test the performance of our model with some other 
implementations for predictions. As there is no other closer 



work related to app prediction with multiple networks, we 
here compare prediction performance with some alternative 
approaches we can think of. 

Since it is practically difficult to observe every user app 
installation behaviors, in our experiments we also want to 
test the performance of each algorithm when the test set is 
small. In particular, we evaluate the performance of differ- 
ent implementations with two approaches for cross valida- 
tion: 1) Normal-size training set: We randomly choose half 
of all the apps in the dataset as the training set, and test on 
the other half of the dataset. 2) Small-size training set: We 
randomly choose only 20% of all the app installations in our 
dataset as the training set, and test on the the rest 80% apps. 
In both cases, we repeat the process for five times for cross 
validation and take average of the results. 

For our algorithm, we feed it with networks 
GP,G'',G^,Gf and G" obtained by phones and sur- 
veys as described previously. For SVM, we apply two 
different approaches in predictions: 

• We don't consider the underlying network, but simply use 
the adoption status of all other nodes as the features for 
each node. We test this approach simply to establish a 
baseline for prediction. We refer it as "S VM-raw". 

• We compute the potential for each candidate net- 
work G™, and we use all the potentials from all candi- 
date networks as features. Therefore, we partially borrow 
some ideas from our own model to implement this SVM 
approach. We refer this approach as "S VM-hybrid". 

We use a modern SVM implementation 
( Chang and Lin 2001), which is capable of generating 
probabilistic predictions rather than binary predictions. 

We also replace Eq. |5]with a linear regression model by 
using p™(i),VTO together with # of apps per user (instead 
of learning s„ in our MLE framework) as independent vari- 
ables. We call this approach "Our Approach (Regression)" 
in the following text to distinguish the difference. We also 
force the non-negative accumulation assumption in the re- 
gression setting. 

Results for both the normal-size training set and the small- 
size training set are shown in Table |3] and we discover that 
our algorithm outperforms other competing approaches in 
all categories. However, we notice that with many our model 
assumptions, generic methods can also achieve reasonably 
well results. Performance on half of the users that are less 
active in app installation is also shown. Because this group 
of users are very inactive, they may be more susceptible to 
network influence in app installation behaviors. We notice 
that our algorithm performs better in this group with more 
than 10% improvement over other methods. 

Predicting Future Installations 

In app marketing, one key issue is to predict future app in- 
stallations. Predicting future app adoption at time t in our 
model is equivalent to predicting installation with part of the 
neighbor adoption status unknown. These unknown neigh- 
bors who haven't adopted at time t may or may not adopt at 
t' > t. Though our algorithm is trained without the informa- 
tion of time of adoption, we show here that the inferred in- 





RMSE 


MP-5 


Fi Score 


Net.+ Ind. Var. + Exogenous Factor 


0.25 


0.31 


0.43 


Net. + Ind. Var. 


0.26 


0.29 


0.42 


Ind. Variance Only 


0.29 


0.097 


0.24 


Net. Only (non-negative) 


0.26 


0.24 


0.37 


Net. Only (allow negative) 


0.30 


0.12 


0.12 



Table 2: The performance of our approach under five different configurations. We observe that modeling both individual vari- 
ance and networks are crucial in performance as well as enforcing non-negative composition for candidate networks as in Eq. 

m 



Methods 


Using 20% as Training Set 

All Users 


Using 50% as Training Set 

All Users 


Using 50% as Training Set 

Low Activity Users 




MP-5 


Fi Score 


MP-5 


Fi Score 


MP-5 


Fi Score 


Our Approach 


0.28 


0.46 


0.31 


0.43 


0.20 


0.43 


SVM-raw 


0.17 


0.26 


0.24 


0.32 


0.14 


0.27 


SVM-hybrid 


0.14 


0.29 


0.27 


0.30 


0.16 


0.30 


Our Approach (Regression) 


0.27 


0.42 


0.30 


0.41 


0.18 


0.39 


Random Guess 


0.081 


0.17 


0.081 


0.17 


0.076 


0.14 



Table 3: Prediction performance for our algorithm and competing methods is shown. 



dividual variance s„ and composite vector (ai, aj\/) can 
be used to predict future app adoption. 

We here apply the following cross-validation scheme to 
test our algorithm's ability in predicting future installations: 
For the adopters of each app, we split them to two equal-size 
groups by their time of adoption. Those who adopted earlier 
are in Gl, and those who adopted later are in G2. The train- 
ing phase is the same as the previous section; In the testing 
phase, each algorithm will only see adoption information for 
subjects in Gl, and predict node adoption for the rest. The 
nodes in G2 will be marked as non-adopters during predic- 
tion phase. 

Results from cross validation are shown in Table |4] We 
notice that our algorithm still maintains the best perfor- 
mance and limited decrease in accuracy compared with Ta- 
ble [3] Since the number of adopted nodes are fewer than 
those in Table [3] we here show MP with smaller k in Table 





MP-fc 

fc=3 fc=4 fc=5 


Fi Score 


Our Approach 


0.18 0.16 0.15 


0.35 


SVM-hybrid 


0.15 0.13 0.12 


0.32 


Our(Regression) 


0.17 0.15 0.14 


0.33 


Random 


0.045 0.045 0.045 


0.090 



Table 4: MP-fc and Fi scores for predicting future app in- 
stallations are shown above. 



Notice in Table |4] that the random guess precision is re- 
duced by half. Therefore, even the precision here is 30% 
lower than in Table |3] it is mainly due to the fact that nodes 
in Gl are no longer in the predicting set. Our accuracy is 
considerable as it is four times better than random guess. 



Predictions With Missing Historical Data 

In practice, sometimes it is not possible to observe the app 
installation for all users due to privacy reasons. Instead, for 
app market markers they may only be allowed to observe and 
instrument a small subset of a community. We here want to 
study if it is still possible to make some prediction in app 
installations under such circumstance. 

To formally state this problem, we assume that all the 
nodes 1, [/ are divided into two groups. The observable 
group Gl and the unobservable group G2. During cross val- 
idation, only nodes in the observable group are accessible to 
our algorithms in the training process, and nodes in the un- 
observable group are tested with the prediction algorithms. 
Therefore, for our algorithm, even the individual variance 
Su,u G Gl is computed in the training process, we will not 
have Su' , u' G G2 for Eq.|5]in the testing phase. We illustrate 
the prediction precision results in Fig. |3] It seems that even 
trained on a different set of subjects without calibrating users 
variance, the composite vector learned by our algorithm can 
still be applied to another set of users and achieve 80% over 
random guess. 

Conclusion 

Our contributions in this paper include a) We show the data 
of a novel mobile phone based experiments on the app in- 
stallation behavior; b) We illustrate that there are strong net- 
work effects in app installation patterns even with tremen- 
dous uncertainty in app installation behavior; c) We show 
that by combining measurable networks using modem smart 
phones, we can maximize the prediction accuracy; d) We de- 
velop a simple discriminative model which combines indi- 
vidual variance, multiple networks and exogneous factors, 
and our model provides prediction accuracy four times bet- 
ter than random guess in predicting future installations. 
Future works include the causality problem in studying 




Percentage of All Subjects Used tor Training 



Figure 3: The MP from our approach and two comparison 
approaches. We here set k for MP to be the average number 
of users in G2 for each testing app. 



network phenomena and a temporal model for app adoption. 
We believe the former one can be done with a much care- 
fully crafted lab experiments. For the latter one, we have 
attempted multiple temporal adoption models but failed. We 
suspect that the mechanism of temporal diffusion of apps is 
very complicated, and we leave this as a future work. 

Though our convex optimization framework is fast and 
reasonably scalable, it should be noted that still the proposed 
method in this paper may not be suitable to handle data from 
billions of cell phone users. Potential solutions include di- 
viding users into small clusters and then conquering, and 
sampling users for computation. The scalability problem re- 
mains a future work. 
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